Engineering leader at Posit, PBC.
15+ years building data tools in R and Python for scientific computing.
PMC member and maintainer of Apache Arrow; author of dittodb
Experienced with large-scale data analysis, modeling, and enterprise tools
Applied statistician at Highmark Health and Utah State University
15+ years of R programming and package development experience
Maintainer of data.table and 3 other R packages
Consultant on NSF grant supporting data.table infrastructure
Works with large datasets (millions of rows, hundreds of columns)
Associate Professor of Statistics and Data Science at Cal Poly
Co-author of R packages flair and tidyclust
Consultant on NSF grant supporting data.table infrastructure
Research experience with high-volume, in-memory data
Make sure you have the most recent version of R. (Recommended minimum: R4.0)
Make sure you have the most recent version of RStudio. (Recommended minimum: 2025 release.)
(Positron is probably fine, but we haven’t stress-tested that.)
(Optional) Download all workshop materials.
(Also available as “installs.R” in the workshop materials repository.)
This subset only includes the person-level data for years 2005, 2018, 2021 and only for states Alaska, Alabama, Arkansas, Arizona, California, Washington, Wisconsin, West Virginia, and Wyoming.
Simply download it and unzip it into a directory called data in your working directory and you can run the examples in the workshop.
We also host a full version of the dataset in AWS S3.
Once you have setup your AWS account and CLI, download the data into a data directory to use:
aws s3 cp --recursive s3://scaling-arrow-pums/ ./data/
This is the full dataset, but does require that you setup your AWS CLI and wait for the dataset to be downloaded.
For this workshop:
Most analysis of PUMS data starts with subsetting the data. Either by state (or even smaller) or year and often both.
But with the tools we learn about in this workshop, we actually can analyze the whole dataset.
Though we have not purposefully altered this data, this data should not be relied on to be a perfect or even possibly accurate representation of the official PUMS dataset.
Help you navigate when you need a speed-up trick and which tools will help you.
Get you off the ground using data.table for faster operations on large data fully in R.
Show you how to set up a duckdb database and use arrow and duckplyr to partition your analysis.
Give you tools for a unified workflow of these tools.